CONVERGENCE OF GRADIENT-BASED ALGORITHMS FOR THE 
HARTREE-FOCK EQUATIONS 
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Abstract. The numerical solution of the Hartree-Fock equations is a central problem in quantum 
chemistry for which numerous algorithms exist. Attempts to justify these algorithms mathematically 
have been made, notably in [5], but, to our knowledge, no complete convergence proof has been pub- 
lished. In this paper, we prove the convergence of a natural gradient algorithm, using a gradient 
inequality for analytic functionals due to Lojasiewicz [15] ■ Then, expanding upon the analysis of [5], 
we prove convergence results for the Roothaan and Level-Shifting algorithms. In each case, our method 
of proof provides estimates on the convergence rate. We compare these with numerical results for the 
algorithms studied. 



1. Introduction 

In quantum chemistry, the Hartree-Fock method is one of the simplest approximations of the elec- 
tronic structure of a molecule. By assuming minimal correlation between the N electrons, it reduces 
Schrodinger's equation, a linear partial differential equation on M. 3N , to the Hartree-Fock equations, a 
system of N coupled nonlinear equations on R 3 . This approximation makes it much more tractable 
numerically. It is used both as a standalone description of the molecule and as a starting point for 
more advanced methods, such as the M0ller-Plesset perturbation theory, or multi-configuration meth- 
ods. Mathematically, the Hartree-Fock method leads to a coupled system of nonlinear integro-differential 
equations, which are discretized by expanding the solution on a finite Galerkin basis. The resulting non- 
linear algebraic equations are then solved iteratively, using a variety of algorithms, the convergence of 
which is the subject of this work. 

The mathematical structure of the Hartree-Fock equations was investigated in the 70 's, culminating 



in the proof of the existence of solutions by Lieb and Simon 13 , later generalized by Lions [14]. On 
the other hand, despite their ubiquitous use in computational chemistry, the convergence of the various 
algorithms used to solve them is still poorly understood. A major step forward in this direction is 
the recent work of Cances and Le Bris [,5]. Using the density matrix formulation, they provided a 
mathematical explanation for the oscillatory behavior observed in the simplest algorithm, the Roothaan 
method, and proposed the Optimal Damping Algorithm (ODA), a new algorithm inspired directly by 
the mathematical structure of the constraint set [4] . This algorithm was designed to decrease the energy 
at each step, and linking the energy decrease to the difference of iterates allowed the authors to prove 
that this algorithm "numerically converges" in the weak sense that \\Dk — -Dfc-i|| — > 0. In addition, the 
algorithm numerically converges towards an Aufbau solution m. This, to our knowledge, is the strongest 
convergence result available for an algorithm to solve the Hartree-Fock equations. 

However, this is still mathematically unsatisfactory, as it does not guarantee convergence, and merely 
prohibits fast divergence. The difficulty in proving convergence of the algorithms used to solve the 
Hartree-Fock equations lies in the lack of understanding of the second-order properties of the energy 
functional (for instance, there are no local uniqueness results available). In other domains, the con- 
vergence of gradient-based methods has been established using the Lojasiewicz inequality for analytic 



functionals [15] (see for instance 10 18]). This method of proof has the advantage of not requiring any 
second-order information. 

In this paper, we use a gradient descent algorithm to solve the Hartree-Fock equations. This algorithm 
builds upon ideas from differential geometry [8] and the various projected gradient algorithms used in 



the context of quantum chemistry 16 16 . To our knowledge, this particular algorithm has never been 



applied to the Hartree-Fock equations. Although it lacks the sophistication of modern minimization 
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methods (for instance, see the trust region methods of [9] and [TT]), it is the most natural generalization 
of the classical gradient descent, and, as such, the simplest one to analyze mathematically. For this 



algorithm, following the method of 18 , we prove convergence, and obtain explicit estimates on the 
convergence rate. We also apply the method to the widely used Roothaan and Level-Shifting algorithms, 
effectively linking these fixed-point algorithms to gradient methods. 

The structure of this paper is as follows. We first introduce the Hartree-Fock problem in the math- 
ematical setting of density matrices and prove a Lojasiewicz inequality on the constrained parameter 
space. We then introduce the gradient algorithm, and prove some estimates. We show the convergence 
and obtain convergence rates for this algorithm, then extend our method to the Roothaan and Level- 
Shifting algorithm, using an auxiliary energy functional following [5j. We finally test all these results 
numerically and compare the convergence of the algorithms. 

2. Setting 

We are concerned with the numerical solution of the Hartree-Fock equations. We will consider for 
simplicity of notation the spinless Hartree-Fock equations, where each orbital fa is a function in L 2 (E 3 , K), 
although our results are easily transposed to other variants such as General Hartree-Fock (GHF) and 
Restricted Hartree-Fock (RHF). 

In this paper, we consider a Galerkin discretization space with finite orthonormal basis (xi)i=i...N b - 
In this setting, the orbitals fa are expanded on this basis, and the operators we consider are JVj, x Nj, 
matrices. This finite dimension hypothesis is crucial for our results, and we comment on it in the 
conclusion. 

The Hartree-Fock problem consists in minimizing the total energy of a N-body system. We describe 
the mathematical structure of the energy functional and the minimization set, and introduce a natural 
gradient descent to solve this problem numerically. 

2.1. The energy. We consider the quantum N-body problem of N electrons in a potential V (in most 
applications, V is the Coulombic potential created by a molecule or atom). In the spinless Hartree- 
Fock model, this problem is simplified by assuming that the N-body wavefunction ip is a single Slater 
determinant of N L 2 -orthonormal orbitals fa . A simple calculation then shows that the energy of the 
wavefunction ip can be expressed as a function of the orbitals fa, 

m = Y.( W+/ V P+ \f I ^M dxdy, 
~[ Js? 2 JR3 2 JR3 Jr3 \x - y\ 



where r(x,y) = ^i=i4>i{x)fa{y) and p(x) = t(x,x). 

The energy is then to be minimized over all sets of orthonormal orbitals fa. An alternative way of 
looking at this problem is to reformulate it using the density operator D. This operator, defined by its 
integral kernel D(x,y) = r(x,y), can be seen to be the orthogonal projection on the space spanned by 
the fa's. The energy can be written as a function of D only: 

E{D) = Tv{{h+ l -G{D))D), (2.1) 



h=-\A + V, 
(G(D)fa(x)=( P *')(x)fax)- r*M*>v) 



where 



y \x-v\ 

This time, the energy is to be minimized over all orthogonal projection operators of rank N. In the 
discrete setting, the orbitals fa are discretized as fa = X)iSi c uX»i an d the operators D, h, and G(D) 
become Nb x Nb matrices. 

2.2. The manifold V . The Hartree-Fock energy is to be minimized over the set of density matrices 
D e V = {D e M Nb (R),D T = D,D 2 = D,TyD = N}. 
The manifold V is equipped with the canonical inner product in M^ b (M.) 

(A, B) = Tr(A T B). 

We denote by ||j4|| = \J (A, A) the Frobenius (or Hilbert-Schmidt) norm of A, which is the most natural 
here, and by ||A|| = maxj^n^x || Ax\\ the operator norm of A. 



The Ricmannian structure of V allows us to compute the gradient of E. The tangent space TdV at a 
point D is the set of A such that D + A verifies the constraints up to first order in A, that is, such that 
A T = A, DA + AD = A,Tr A = 0. Block-decomposing A on the two orthogonal spaces range(D) and 
ker(D), this implies that the tangent space T D V is the set of matrices A of the form 





where A is an arbitrary (Nf, — N) x N matrix. 

Hence, the projection on the tangent space of an arbitrary symmetric matrix M is given by 

P D {M) = DM (I - D) + (1 - D)MD 
= [D, [D,M]]. 

(B A T \ ( A T \ 

A J , then [D, M \ = \_ A q I and [A [A M \\ = 

which shows that || [D[D, M]] || = || [D, M] || , a property that will be useful in the sequel. 
We can now compute the gradient of E. First, the unconstrained gradient in M]y b (R) is 

VE(D) = F D = h + G(D), 

the Fock operator describing the mean field generated by the electrons of D. We obtain the constrained 
gradient \7-pE by projecting onto the tangent space: 

V V E(D) = P D (VE(D)) 
= [D,[D,F D ]]. 



2.3. tojasiewicz inequality. The Lojasiewicz inequality for a functional / around a critical point xq 
is a local inequality that provides a lower bound on V/. Its only hypothesis is analyticity. In particular, 
no second order information is needed, and the inequality accommodates degenerate critical points. 



2.3.1. The classical Lojasiewicz inequality. 

Theorem 2.1 (Lojasiewicz inequality in W 1 ). Let f be an analytic functional from R" to R. Then, for 
each x € K™, there is a neighborhood U of x and two constants n > 0, 9 £ (0,1/2] such that when 

x e U, 

This inequality is trivial when xq is not a critical point. When x is a critical point, a simple Taylor 
expansion shows that, if the Hessian V 2 /(x ) is invertible, we can choose 6 = 5 an d k > J- — , 

where Ai is the eigenvalue of lowest magnitude V 2 /(xo). When V 2 /(xo) is singular (meaning that xo 
is a degenerate critical point), the analyticity hypothesis ensures that the derivatives cannot all vanish 
simultaneously, and that there exists a differentiation order n such that the inequality holds with 6 = ^ . 



2.3.2. Lojasiewicz inequality on V . We now extend this inequality to functionals defined on the manifold 
V. 

Theorem 2.2 (Lojasiewicz inequality on V). Let f be an analytic functional from V to K. Then, for 
each D n £ V, there is a neighborhood U of D and two constants k > 0, 9 e (0,1/2] such that when 
DeU, 

\f(D)- fiDo)^ 9 <K\\W v f(D)\\. 

Proof. Let D G V. Define the map Rd from Tu Q V to V by 

R Do (A) = UD U T , 

L/ = exp(-[A),A]). 
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Figure 2.1. The map Rd, 



This map is analytic and verifies Rd o (0) = Dq, cLRd q (0) = idro -p. Therefore, by the inverse function 
theorem, the image of a neighborhood of zero contains a neighborhood of Dq. We now compute the 
gradient of / = / o Rjj B at a point A, with D — Rd (A). 

/(A + S) = f(D) + (X7 v f(D),dR Do (A)5) + 0(6 2 ) 
= f(D) + (dR Do (A)*\7 v f(D), S) + 0(5 2 ) 
= /(£>) + (P Do dR Do (A)*W v f(D),S) + 0(S 2 ). 

We deduce 

V Tdo -p/(A) = P Da dR Do (A)*W v f(D). 

We can now apply the Lojasiewicz inequality to /, which is an analytic functional defined on the 
Euclidean space To Q M.. We obtain a neighborhood of zero in Td "P, and therefore a neighborhood U of 
Do on which 

\f(D) - /(Do)] 1 ' 6 < k\\P Do dR Do (A)*V v f(D)\\ . 
As dR* Dg is continuous in A, up to a change in the neighborhood U and the constant k, 

\f(D)-f(D )\ 1 - 8 <K\\W v f(D)\\. 

□ 



3. The gradient method 
3.1. Description of the method. The gradient flow 

^ = -V V E{D) 

= -[D,[D,F D }} (3.1) 

is a way of minimizing the energy over the manifold V . This continuous flow was already used to solve 
the Hartree-Fock equations in [l] (although the authors used a formulation in terms of orbitals, whereas 
we use density matrices). 

The naive Euler discretization 

D k+1 =D k -t[D k ,[D k ,F k ]] 

is not suitable because it does not stay on V . A variety of approaches deal with this problem. One of the 
first algorithms to solve the Hartree-Fock equations [16] used a purification method to project D k+ i back 
onto V . More recently, an orthogonal projection on the convex hull of V was used for that purpose [6j. 
Although we focus in this paper on a different gradient method, such projection methods have the same 
behavior for small stepsizes and can be treated in the same framework, provided one can prove results 
similar to Lemmas 13. II and 13.21 below. 
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We look for A+i on a curve on V that is tangent to V-p-E(A)- A natural curve on V is the change 
of basis 



D'(t) = U t DUj, 
where U t is a smooth family of orthogonal matrices. If we take 

U t =exp{t[D,F D ]), 

we get 



dA 



-[D,[D,F D ] 



so D'(t) is a smooth curve on V, tangent to the gradient flow at t 
Our gradient method with a fixed step t is then 



with 



A+i = U k D k Ui 



U k = exp(t[D k ,F k }). 



(3.2) 
(3.3) 



This method, as well as various generalizations, is described in [8j. 

We now prove a number of lemmas which are the main ingredients of the convergence proof. First, we 
bound the second derivative of the energy to obtain quantitative estimates on the energy decrease, then 
we link the difference of iterates A+i — D k to the gradient V-p-E(A), and finally we use the Lojasiewicz 
inequality to establish convergence. 

3.2. Derivatives. We start from a point Do and compute the derivatives of E along the curve A = 
U t D U_ t . For ease of notation we will write e(t) = E(D t ), F t = F(D t ) and C t = [D t , F t ]. 



dD t 
dt 

d n D t 
dt n 



m D U_ t + U t D AU -> 



dt 

[C , A], 

d n-l 

[C , [C , . . . [C , D t ] 



dt 



de 

d7 



dc 

di 



d\ 
dt 2 



n times Co 

Tr(F t [C ,A]), 

-IIColl 2 , 

Tr {Ft [Co , [C , A]]) + Tr (G( [C , D t ] ) [C , D t ] ) . 



3.3. Control on the curvature. 

Lemma 3.1. There exists a > such that for every Dq and t, 

d 2 e 



dt 2 



(*) < "IIColl 



Proof. 



dt 2 



= Tr(F t [C , [C , A]]) + Tr(G([C , A])[C , A])- 



(3.4) 



First, since the Laplacian in F(D) acts on a finite dimensional space, we can bound F(D): 

1 ||-A|| op + 2(27V + Z)J|-A|| op 



\F{D)\\ op < - ||-A|| op +||y|| op + ||C(D)|| op 
< 



(3.5) 



by the Hardy inequality. Next, making use of the operator inequality Ty(AB) <\\A\\ \\B\\, we show that 



Tr(F t [C , [C , D t }}) < 2 ( ^||-A|| op + 2(2N + Z)yf— Aj[ op ) ||C | 



For the second term of (3.4 1, 



Tr(G([C , A]) [C , A]) <\\G([C , A])|| op Tr (| [C , D 

^ Tr (l [Co ' 



• 4,/' - Ail ..Tr i'! : 'C ,A 
<167vJ|-A|| op ||C || 2 . 



The result is now proved with 



a=\\-A\\ op + 4(6N+Z)J\\-A\\ op . 



3.4. Choice of the stepsize. We can now expand the energy: 

e (i)< e (0)-i||C || 2 + ^a||C || 2 . 



□ 



If we choose 



t<-, (3.6) 



a 

we obtain a decrease of the energy 

£ (t)<6(0)-/?||C || 2 (3.7) 

with /3 = t- fa > 0. 

The optimal choice of t with this bound on the curvature is t = with /3 = ~ . Of course it could be 
that the actual optimal t is different, and could vary wildly, which is why we will not consider optimal 
stepsizes. 

3.5. Study of A+i — A- We now prove that our iteration A+i = UkD^U^ coincides with an uncon- 
strained gradient method up to first order in t. 

We say that M £ o(\\ N\\) when for all e > 0, there is a neighborhood U of zero such that when N E U, 
\\M\\ < e\\N\\. Note that this neighborhood U is not allowed to depend on N, meaning that the resulting 
bound is uniform, which will allow us to manipulate the remainders more easily. 

Lemma 3.2. For any k, 

A+i = D k + t[C k ,Dk]+o(t\\C k \\). 



Proof. 



A+i - A - t[C k , A] = V -r [C k , [C fc , . . . [C k , A] 



71=2 



n times Ck 

oo 

|A+i - A -t[C k) A] || < t\\[C k , A] || ^^'llCfeU"- 1 

n=2 



□ 
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4. Convergence of the gradient algorithm 



Theorem 4.1 (Convergence of the gradient algorithm). Let Do £ V be any density matrix and Dk be the 

'k ■ 



sequences of iterates generated from D$ by Dk+i = U^D^Uj^ , with stepsize t < — . Then Dk converges 



towards a solution of the Hartree-Fock equations. 

Proof. The energy E(D) is bounded from below on V, and therefore E k converges to a limit E^. In 
the sequel we will work for convenience with E(D) — E(D) — E^ and drop the tildes. Immediately, 



summing (3.7 1 implies that C k converges to 0, and therefore so does Dk — Dk-i (this is what Cances 
and Le Bris call "numerical convergence" in [5]). Note that we only get that \ \Dk — Dk-i\\ 2 is summable, 
which alone is not enough to guarantee convergence (the harmonic series Xk — X)f=i V IS a simple 
counter-example). To obtain convergence, we shall use the Lojasiewicz inequality. 

Let us denote by T the level set T = {D E T',E(D) = 0}. It is non-empty and compact. We apply 
the Lojasiewicz inequality to every point of T to obtain a cover (Ui)i^x of T in which the Lojasiewicz 
inequality holds with constants Ki,9i. 

By compactness, we extract a finite subcover from the Ui, from which we deduce S > 0, re > and 
9 S (0, 1/2] such that whenever d(D,T) < 8, 

E{D) 1 - < k\\VvE{D)\\ = k\\[D,C d }\\ = k\\C d \\ ■ (4.1) 



(recall from Section 2.2 that || [D, [D, M]]\\ =\\[D, M] || for M symmetric) 

To apply the Lojasiewicz inequality to our iteration, it remains to show that d(Dk,T) tends to zero. 
Suppose this is not the case. Then we can extract a subsequence, still denoted by Dk, such that 
d(Dk,T) > e for some e > 0. By compactness of V we extract a subsequence that converges to a D 
such that d(D,T) > e and (by continuity) E(D) = 0, a contradiction. Therefore d(Dk,T) — > 0, and for 
k larger than some fco, 

E(Dk) 1 - 9 < K\\C k \\. (4.2) 

For k > ko, 



E(D k ) B - E(D k+1 ) e > _ g (E(D k ) - E(D k+1 )) 



^WM\ {E{Dk) - E{Dk+l)) 
> 6 A\c k \\ 

K 

> %\\Dk+i-Dk\\+o(\\D k +i-Dk\\) 

Kt 



hence 



%\\D k+x - D k \\ + o(\\D k +i - AH) < E{D k ) e - E{D k+1 ) 6 . (4.3) 

As the right-hand side is summable, so is the left-hand side, which implies that 5311-^fc+i ~ Dk\\ < 00. 
Dk is therefore Cauchy and converges to some limit Z?oo. Ck — > 0, so D^ is a critical point. 

Note that now that we know that D k converges to Doo, we can replace the 9 and re in this inequality 
by the ones obtained from the Lojasiewicz inequality around D^ only. □ 

Let 

OO 

e fc = - AH • 

l=k 

This is a (crude) measure of the error at iteration number k. In particular, \\D k — Ax>|| < efe- 

Theorem 4.2 (Convergence rate of the gradient algorithm). 

(1) If 9 = 1/2 (non- degenerate case), then for any v' < -^-5, there exists c > such that 

e k <c(l-u') k . (4.4) 

(2) If 6 7^ 1/2 (degenerate case), then there exists c > such that 

e k < ck'T^e . (4.5) 
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Proof. Summing (4.3 1 from I = k to oo with k > fcp, we obtain 



e fc + o(e fc ) < ^E(D k ) e 
up 

^e k + o(e k )) " <E{D k f- e 

Kt I 



< K\\C k \ 

Hence, 



< ~(e k ~ e fe+ i) + o(e k - e k+1 ) 



efc+i < e k ~ ve k e + o(e k 6 ), with 



t ep 

v = — — 

K \ Kt 

Two cases are to be distinguished. If 9 = |, the above inequality reduces to 

efc+i < (1 - f + o(l))e fc 

with {/ = tAt and the result follows. 

The case 6* ^ 1/2 is more involved. We define 

y k = ck~ p , 

which verifies 

y k+1 = c(k+l)- p 

= ck- p (l + l/k)- p 

k 

_i i 
> y k (l-pc py^) 

We set p = and c large enough so that c > and y k[) > e ko . We then prove by induction 

efe < z/fc for k > ko. The result follows by increasing c to ensure that e k < y k , for k < kg- D 

In the non-degenerate case 6=1/2 (which was the case for the numerical simulations we performed, 
see Section [7J, the convergence is asymptotically geometric with rate 1 — v , where 

With the choice t = — suggested by our bounds, the convergence rate is 

1 

V = 1 9 ■ 

5. Convergence of the Roothaan algorithm 

The Roothaan algorithm (also called simple SCF in the literature) is based on the observation that 
a minimizer D of the energy satisfies the Aufbau principle: D is the projector onto the space spanned 
by the eigenvectors associated with the first N eigenvalues of F(D). This suggests a simple fixed-point 
algorithm: take for D k+ i the projector onto the space spanned by the eigenvectors associated with the 
first N eigenvalues of F(D k ), and iterate. Unfortunately, this procedure does not always work: in some 
circumstances, oscillations between two states occur, and the algorithm never converges. This behavior 
was explained mathematically in [5 , where the authors notice that the Roothaan algorithm minimizes 
the bilinear functional 

E(D, D') = Tr(h(D + D')) + Tr (G(D)D') 

with respect to its first and second argument alternatively. Thanks to the Lojasiewicz inequality, we can 
improve on their result and prove the convergence or oscillation of the method. 

The bilinear functional verifies E(D, D') = E(D',D), E(D, D) = 2E(D). In fact, \E(-,-) is the 
symmetric bilinear form associated with the quadratic form E(-). In the following, we assume the 
uniform well-posedness hypothesis of [5], i.e. that there is a uniform gap of at least 7 > between the 
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eigenvalues number N and N + 1 of F(Dk). Under this assumption, it can be proven 3 that there is a 
decrease of the bilinear functional at each iteration 



E(D k+1 ,D k+2 ) = E(D k+2 ,D k+1 ) 
= mmE(D,D k+1 ) 

<E(D kl D k+1 )- 1 \\D k+2 -D k \\ 2 

Since E(-, •) is bounded from below on V x V, this immediately shows that D k — D k+2 — > 0, which 
shows that D 2k and D 2k+ \ converge up to extraction, which was noted in (5l. We now prove convergence 
of these two subsequences, again using the Lojasiewicz inequality. 

E(-,-) is defined on V x V, which inherits the Riemannian structure of V by the natural inner product 
((D 1 ,D' 1 ),(D 2 ,D' 2 )) = (D 1 ,D 2 ) + (D'^D^). In this setting, the gradient is 

V F(D n'\- ([D,F(D')]\ 
V VxV E(D,D)-^ [D ^ F{D)] j. 

and therefore, using the fact that D k+ i (resp. D k+2 ) and F{D k ) (resp F(D k+ i)) commute, 



\\V VxV E(D k ,D k+1 )\\ = \/\\[D k ,F(D k+1 )}\\ 2 +\\[D k+1 ,F(D k )}\\ 2 
= \\[D k ,F(D k+1 )}\\ 
= \\[D k -D k+2 ,F(D k+1 )]\\ 
<2\\F(D k+1 )\\ op \\D k+2 -D k \\ 

A trivial extension of Theorem |2.2| to the case of a functional defined on V x V shows that we can 
apply the Lojasiewicz inequality to E{-, ■). By the same compactness argument as before, the inequality 

E{D kl D k+1 ) 1 - 6 ' < K'\\V VxV E(D k ,D k+1 )\\ 

<2K'\\F(D k+1 )\\ op \\D k+2 - D k \\ 

holds for k large enough, with constants k! > and 9' £ (0, |]. 

The exact same reasoning as for the gradient algorithm proves the following theorems 

Theorem 5.1 (Convergence/oscillation of the Roothaan algorithm). Let Do £ V such that the sequence 
D k of iterates generated by the Roothaan algorithms verifies the uniform well-posedness hypothesis with 
uniform gap 7 > 0. Then the two subsequences D 2k and D 2k+ \ are convergent. If both have the same 
limit, then this limit is a solution of the Hartree-Fock equations. 

Theorem 5.2 (Convergence rate of the Roothaan algorithm). Let D k be the sequence of iterates gener- 
ated by a uniformly well-posed Roothaan algorithm, and let 

00 

e* = £llA+a - AH • 

l=k 

Then, 

(1) If 6' = 1/2 (non-degenerate case), then for any v' < RK ,2^ F \\i , where \\F\\ op is the uniform bound 



on F proved in (3.51, there exists c > such that 

e k <c(l-u') k . (5.1) 
(2) If 9' 7^ 1/2 (degenerate case), then there exists c > such that 

e k <ck"^. (5.2) 



6. Level-shifting 



The Level-Shifting algorithm was introduced in 19 as a way to avoid oscillation in self-consistent 
iterations. By shifting the N lowest energy levels (eigenvalues of F), one can force convergence, although 
denaturing the equations in the process. We now prove the convergence of this algorithm. 
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The same arguments as before apply to the functional 

E b (D, D') = Tr(h(D + D')) + Ti(G(D)D') + ||[Z> - D'f 

= Tr(h(D + D')) + Tr(G(D)D') -bTr(DD') + bN 



with associated Fock matrix F b (D) — F(D) — bD. The difference with the Roothaan algorithm is that 



for b large enough, there is a uniform gap 7 > 0, and D/. 
the following theorems 



Dk+\ converges to (51. Therefore, we have 



Theorem 6.1 (Convergence of the Level-Shifting algorithm). Let Dq 6 V . Then there exists bo > 
such that for every b > bo, the sequence Dk of iterates generated by the Level-Shifting algorithm with 
shift parameter b verifies the uniform well-posedness hypothesis with uniform gap 7 > and converges. 

Theorem 6.2 (Convergence rate of the Level-Shifting algorithm). Let Dk be the sequence of iterates 
generated by the Level-Shifting with shift parameter b > b Q , and let 

00 

e k = J2\\Di+2 ~ AH ■ 

l=k 



The 



(1) Lf 9' = 1/2 (non-degenerate case), then for any v' < - ,2|| > F b|j ; 



there exists c > such that 



e k < c(l - v') 



(6.1) 



(2) Lf 9' 7^ 1/2 (degenerate case), then there exists c > such that 

e' 

e k < ck i 3 ^ 7 . 



(6.2) 



We can use this result to heuristically predict the behavior of the algorithm when b is large. 7 b and 
F b \\ both scale as b for large values of b. Assuming non-degeneracy, we can take k' > 



1 op 



J — , where 



}H 2 , where H 1 = H VxV E(D°° , D°°) 



Ai is the eigenvalue of smallest magnitude of the Hessian Hi , 
11 1 1 2 

and Hi = H-p x -p\\D — D \\ (i) 00 ,!) 00 ). But H2 admits zero as an eigenvalue (for instance, note that 
, 2 

D — D is constant along the curve (D t ,D' t ) = {U t DUj ', U t D'U^), where U t is a family of orthogonal 

matrices), so that, when b goes to infinity, Ai tends to the eigenvalue of smallest magnitude of Hi 
restricted to the nullspace of H2, and therefore stays bounded. Therefore, v' scales as |, which suggests 
that b should not be too large for the algorithm to converge quickly. 



7. Numerical results 

We illustrate our results on atomic systems, using gaussian basis functions. The gradient method was 
implemented using the software Expokit [20] to compute matrix exponentials. In our computations, the 
cost of a gradient step is not much higher than a step of the Roothaan algorithm, since the limiting step 
is computing the Fock matrix, not the exponential. However, the situation may change if the Fock matrix 
is computed using linear scaling techniques. In this case, one can use more efficient ways of computing 
geodesies, as described in (8j. 

First, the Lojasiewicz inequality with exponent | was checked to hold in the molecular systems and 
basis sets we encountered, suggesting that the minimizers are non-degenerate. Consequently, we never 
encountered sublinear convergence of any algorithm. 

For a given molecular system and basis, we checked that the Level-Shifting algorithm converged as 
(1 — v) k , where v is asymptotically proportional to \, which we predicted theoretically in Section 6 (see 



Figure 7.1 ). This means that the estimates we used have at least the correct scaling behavior. 
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Figure 7.1. Convergence of the Level-Shifting algorithm. The convergence is linear 
until machine precision. The horizontal spacing of the curves reveals the asymptotic 
relationship v oc \. The system considered is the carbon atom [N = Z = 6), under the 
RHF formalism, using the 3-21G gaussian basis functions. 



Next, we compared the efficiency of the Roothaan algorithm and of the gradient algorithm, in the 
case where the Roothaan algorithm converges. Our analysis leads to the estimate v = ||2 for the 

° k ii** Hop 

Roothaan algorithm, and v = 4 ^ for the gradient algorithm with stepsize t = — . 

It is immediate to see that, up to a constant multiplicative factor, k' > n, 7 < ||F|j op and for the cases 
of interest a « ||-Fll op , so from our estimates we would expect the gradient algorithm to be faster than 
the Roothaan algorithm. However, in our tests the Roothaan algorithm was considerably faster than the 



gradient algorithm (see Figure 7.2 1. This conclusion holds even when the stepsize is adjusted at each 
iteration with a line search. 

The reason that the Roothaan algorithm performs better than expected is that the inequality 

\\[D k+2 -D k ,F(D k+1 )}\\ <2|| J F( J D fc+1 )|||| J D fe+2 - J D fe || 

is very far from optimal. Whether an improved bound (in particular, one that does not depend on the 
dimension) can be derived is an interesting open question. 

The outcome of these tests seems to be that the gradient algorithm is slower. It might prove to be 
faster in situations where the gap is small, or whenever k' is much larger than k. We have been unable 
to find concrete examples of such cases. 

8. Conclusion, perspectives 

In this paper, we introduced an algorithm based on the idea of gradient descent. By using the 
analyticity of the objective function and of the constraint manifold, we were able to derive a Lojasiewicz 
inequality, and use that to prove the convergence of the gradient method, under the assumption of a 
small enough stepsize. Next, expanding on the analysis of [5], we extended the Lojasiewicz inequality to 
a Lyapunov function for the Roothaan algorithm. By linking the gradient of this Lyapunov function to 
the difference in the iterates of the algorithm, we proved convergence (or oscillation), an improvement 
over previous results which only prove a weaker version of this. In this framework, the Level-Shifting 
algorithm can be seen as a simple modification of the Roothaan algorithm, and as such can be treated 
by the same methods. In each case, we were also able to derive explicit bounds on the convergence rates. 

The strength of the Lojasiewicz inequality is that no higher-order hypothesis are needed for its use. 
As a consequence, the rates of convergence we obtain weaken considerably if the algorithm converges to 
a degenerate critical point. A more precise study of the local structure of critical points is necessary to 

11 




n 



Figure 7.2. Comparison of Roothaan and gradient algorithm. The system considered 
is the carbon atom (N = Z — 6), under the RHF formalism, using the 3-21G gaussian 
basis functions. 



understand why the algorithms usually exhibit geometric convergence. This is related to the problem of 
local uniqueness and is likely to be hard (and, indeed, to our knowledge has not been tackled yet). 

Even though our results hide the complexity of the local structure in the constants of the Lojasiewicz 
inequality, they still provide insight as to the influence of the basis on the speed of convergence, and can be 
used to compare algorithms. All of our results use crucially the hypothesis of a finite-dimensional Galerkin 
space. For the gradient algorithm, we need it to ensure the existence of a stepsize that decreases the 
energy. This is analogous to a CFL condition for the discretization of the equation ^ = —[£), [D, Fn}}, 
and can only be lifted with a more implicit discretization of this equation. For the Roothaan and Level- 
Shifting algorithms, we use the finite dimension hypothesis to bound F(D). As noted in Section [7] 
the inequality is not sharp, so it could be that the infinite-dimensional version of the Roothaan and 
Level-Shifting algorithms still converge. More research is needed to examine this. 

The gradient algorithm we examined only converges towards a stationary point of the energy, that 
may not be a local minimizer, or even an Aufbau solution. However, it will generically converge towards 
a local minimizer, unlike the Level-Shifting algorithm with large b. Therefore, it is the most robust of the 
algorithms considered. Although it was found to be slower than other algorithms on the numerical tests 
we performed, it has the advantage that its convergence rate does not depend on the gap \n+i — ^n, 
and might therefore prove useful in extreme situations. 

An algorithm that could achieve the speed of the fixed-point algorithms with the robustness granted 
by the energy monotonicity seems to be the ODA algorithm of Cances and Le Bris [5j, along with variants 
such as EDIIS, or combinations of EDIIS and DIIS algorithms [12]. We were not able to examine these 
algorithms in this paper. At first glance, the ODA algorithm should fit into our framework (indeed, the 
ODA algorithm was built to satisfy an energy decrease inequality similar to (3.7)). However, it works in 
a relaxed parameter space P, and using the commutator to control the differences of iterates as we did 
only makes sense on V. Therefore, other arguments have to be used. 

A variant on the gradient algorithm used here is to modify the local geometry of the manifold V by 
using a different inner product, leading to a variety of methods, including conjugate gradient algorithms 
[8]. These methods fit into our framework, as long as one can prove that they are "gradient-like", in 
the sense that one can control the gradient by the difference D^ + i — D^. However, precise estimates of 
convergence rates might be hard to obtain. 

Also missing from the present contribution is the study of other commonly used algorithms, such as 



DIIS 17 , and variants of (quasi)-Newton algorithms [2j,[TT] . DIIS numerically exhibits a complicated 
behavior that is probably hard to explain analytically, and the (quasi)-Newton algorithms require a study 
of the second-order structure of the critical points, which we are unable to do. 
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